July 2017

Outline

Somatic vs Germline SNVs


Numbers of somatic SNVs in different cancer types



Source: ICGC Data Portal


Somatic SNV callers typically set the expected mutation rate to be around 5 mutations per megabase, i.e. a total of 15,000 mutations across the genome.

Several factors complicate somatic SNV calling

  • Low cellularity (tumour DNA content)

  • Intra-tumour heterogeneity in which multiple tumour cell populations (subclones) exist

  • Aneuploidy

  • Unbalanced structural variation (deletions, duplications, etc.)


  • Matched normal contaminated with cancer DNA

    • adjacent normal tissue may contain residual disease or early tumour-initiating somatic mutations

    • circulating tumour DNA in blood normals


  • Sequencing errors

  • Alignment artefacts

Mwenifumbo & Marra, Nat Rev Genet. 2013

Issues affecting mutation detection in cancer

In this example the tumour was sequenced to an average depth of 50.


  • Is this sufficient?

  • Consider the 50 observations of our tumour which carries a mutation at this base

Issues affecting mutation detection in cancer

Tumour cellularity


  • In fact the 'tumour' sample has some normal contamination

  • 40% of our reads could easily be from the normal sample

Issues affecting mutation detection in cancer

Tumour heterogeneity


  • The tumour may be heterogeneous, i.e. DNA may have been sampled from a number of subclones

  • The mutation may exist in one or more subclones but not in others

Issues affecting mutation detection in cancer

Copy number


  • The tumour may not be diploid

  • If it is triploid, then a mutation may be in only one in three of the tumour reads

Issues affecting mutation detection in cancer

Uneven coverage


  • 50-fold coverage is the average — biases, e.g. GC content, mean that coverage is not uniform

  • For 10% of the genome we only reach a depth of around 40

Issues affecting mutation detection in cancer

Sampling


  • Finally, remember that we are taking a random sample

  • We may not get lucky!


Credit: Andy Lynch

Allele frequencies

Germline SNVs  AF = 0 (homozygous reference), 0.5 (heterozygous variant) or 1 (homozygous variant)

Somatic SNVs typically exist at a continuous range of variant allele frequencies.

CaVEMan SNV caller

CaVEMan (Cancer Variants through Expectation Maximization) is the somatic SNV caller in the Sanger CGP pipeline


  • Bayesian probabilistic classifier

  • Can make use of copy number profiles and estimate of normal contamination of the tumour if available

  • Considers base quality, lane and read position and orientation to provide more accurate estimates of sequencing error rates within the algorithm

  • Several post-processing filters applied to increase specificity of calls


Caveat — full details of CaVEMan are not explicitly reported anywhere.

Other somatic SNV callers

Tool Approach Source Publication
SomaticSniper Bayesian Washington University Larson et al., 2012
VarScan 2 Fisher's exact Washington University Koboldt et al., 2012
JointSNVmix Bayesian British Columbia Cancer Agency Roth et al., 2012
mutationSeq Machine learning British Columbia Cancer Agency Ding et al., 2012
Strelka Bayesian Illumina Saunders et al., 2012
MuTect Bayesian Broad Institute Cibulskis et al., 2013
qSNP Heuristic University of Queensland Kassahn et al., 2013
VarDict Fisher's exact AstraZeneca Lai et al., 2016
MuSE Markov substitution model (Bayesian) MD Anderson Cancer Center Fan et al., 2016

By no means an exhaustive list but includes some of the most popular somatic SNV callers.

Simple statistical approach


Reference base \(\boldsymbol{C}\)
Normal: 48 reads all supporting reference \(\boldsymbol{\{\ 48C\ \}}\)
Tumour: 50 reads, 7 of which have an alternate \(\boldsymbol{T}\) allele \(\boldsymbol{\{\ 43C,\ 7T\ \}}\)


2 x 2 contingency table


  • Allele fraction of the alternate allele in the tumour, \(AF = 7/50 = 0.14\)

  • Perform a Fisher's exact test to determine whether a variant has a significant difference in AF between the two samples

  • Method used by VarScan and VarDict

Fisher's exact test in R

contingency_table <- t(matrix(c(43, 7, 48, 0), nrow = 2, byrow = TRUE,
                              dimnames = list(c("Tumour", "Normal"), c("Ref", "Alt"))))
contingency_table
##     Tumour Normal
## Ref     43     48
## Alt      7      0
fisher.test(contingency_table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.01254
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.0000000 0.6701136
## sample estimates:
## odds ratio 
##          0

Bayesian statistical approach

Most probabilistic methods for variant calling are based on Bayes' Theorem


\[ P(G \mid D) = \frac{P(D \mid G)P(G)}{P(D)} \]

\(G\) = genotype, e.g. AA, AB, BB

\(D\) = data, i.e. set of sequence reads at the genomic position of interest

\(P(G \mid D)\) is the probability of a given genotype, \(G\), given the observed data.


The probability of observing the given set of sequence reads is the weighted average of the probabilities of observing those reads for each possible genotype:

\[ P(D) = \sum\limits_{i=1}^nP(D \mid G_i)P(G_i) \]

Germline SNV calling (single sample)

Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P = 10^{-30/10} = 0.001\right)\)

Likelihood of data

\[ \begin{eqnarray} \boldsymbol{P(D \mid CC)} &=& \mathrm{Probability\ of\ two\ Q30\ errors} &=& 10^{-3}\times10^{-3} &=&\ \ 10^{-6} \\[5pt] \boldsymbol{P(D \mid TT)} &=& \mathrm{Probability\ of\ four\ Q30\ errors} &=& \left(10^{-3}\right)^4 &=&\ \ 10^{-12} \\[5pt] \boldsymbol{P(D \mid CT)} &=& \mathrm{Probability\ of} \{ C, C, C, C, T, T \} &=& \left(1/2\right)^6 &=&\ \ 0.0156 \end{eqnarray} \]


Priors                    \(\boldsymbol{P(CC)} = 0.9985,\ \ \boldsymbol{P(CT)} = 0.001,\ \ \boldsymbol{P(TT)} = 0.0005\)


Probability of observing \(\boldsymbol{\{ 4C, 2T \}}\)

\[ \begin{eqnarray} \boldsymbol{P(D)} &=& P(D \mid CC)(P(CC)\ +\ P(D \mid CT)P(CT)\ +\ P(D \mid TT)P(TT) \\[5pt] &=& 1.66\times 10^{-5} \end{eqnarray} \]

Germline SNV calling (single sample)

Reference \(\boldsymbol{C}\), observe 6 reads \(\boldsymbol{\{\ 4C,\ 2T\ \}}\) all with base quality \(\mathrm{Q} = 30\ \left(P = 10^{-30/10} = 0.001\right)\)

Likelihood of data

\[ \begin{eqnarray} \boldsymbol{P(D \mid CC)} &=& \mathrm{Probability\ of\ two\ Q30\ errors} &=& 10^{-3}\times10^{-3} &=&\ \ 10^{-6} \\[5pt] \boldsymbol{P(D \mid TT)} &=& \mathrm{Probability\ of\ four\ Q30\ errors} &=& \left(10^{-3}\right)^4 &=&\ \ 10^{-12} \\[5pt] \boldsymbol{P(D \mid CT)} &=& \mathrm{Probability\ of} \{ C, C, C, C, T, T \} &=& \left(1/2\right)^6 &=&\ \ 0.0156 \end{eqnarray} \]


Priors                    \(\boldsymbol{P(CC)} = 0.9985,\ \ \boldsymbol{P(CT)} = 0.001,\ \ \boldsymbol{P(TT)} = 0.0005\)

Posterior \[ \boldsymbol{P(CC \mid D)} = \frac{P(D \mid CC)P(CC)}{P(D)} = \frac{10^{-6}\times 0.9985}{1.66\times 10^{-5}} = 0.06 \]


Result\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ P(CC \mid D) = 0.06\), \(\ \ \boldsymbol{P(CT \mid D) = 0.94}\), \(\ \ P(TT \mid D) = 3 \times 10^{-11}\)


Source: Heng Li, Broad Institute MPG workshop 2011

Somatic SNV calling

Joint genotypes

\(G^N, G^T\) = genotypes of normal and tumour

Bayesian approach extended to somatic SNVs

Posterior probability of the joint genotype

\[ \begin{eqnarray} \boldsymbol{P(G^N, G^T \mid D^N, D^T)} &\propto& P(D^N, D^T \mid G^N, G^T)P(G^T, G^N) \\[5pt] &\propto& P(D^N \mid D^T, G^N, G^T)P(D^T \mid G^N, G^T)P(G^N, G^T) \end{eqnarray} \]


\(P(D^N \mid D^T, G^N, G^T) = \boldsymbol{P(D^N \mid G^N)}\)

if we assume that the normal sample does not contain any reads sequenced from the tumour, i.e. \(D^N\) independent of \(G^T\) — treat as in the single sample case.


\(\boldsymbol{P(D^T \mid G^N, G^T)}\) — scope to incorporate estimates of tumour purity, copy number, etc.


\(\boldsymbol{P(G^N, G^T)}\) — prior for joint genotype; could treat as independent events, \(P(G^N, G^T) = P(G^N)P(G^T)\), but not realistic since \(T\) and \(N\) samples from same individual so share germline variants.

VCF output file